{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "5fa5cd08",
   "metadata": {},
   "source": [
    "# Tutorial 02 - Loading Data from Unstructured Directory\n",
    "\n",
    "In Tutorial 01 we assumed a specific folder structure to load the audio files and create a PyTorch Dataset. This is restrictive as in most cases the dataset comes in a folder containing all audio files and the individual splits are determined by some other structure (e.g., `csv` or `json` files, etc.). In this Tutorial we demonstrate an alternative and more Pythonic-way to load your data and create the Audio Classification Dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a1fd5904",
   "metadata": {},
   "source": [
    "## 1. Dataset Downloading & Inspection\n",
    "\n",
    "For the purposes of this Tutorial we use the SpeechCommands dataset, we use a small version of the dataset consisting of 12 spoken english commands (e.g., \"down\", \"go\", \"left\", etc.) from various speakers. More information about the dataset can be found in the [HEAR](https://arxiv.org/abs/2203.03022) evaluation benchmark dataset. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "be51b64c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2026-04-17 15:24:25--  https://zenodo.org/records/5887964/files/hear2021-speech_commands-v0.0.2-5h-48000.tar.gz?download=1\n",
      "Resolving zenodo.org (zenodo.org)... 188.185.43.153, 188.185.48.75, 188.184.103.118, ...\n",
      "Connecting to zenodo.org (zenodo.org)|188.185.43.153|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 1430299345 (1.3G) [application/octet-stream]\n",
      "Saving to: ‘hear2021-speech_commands-v0.0.2-5h-48000.tar.gz?download=1’\n",
      "\n",
      "hear2021-speech_com 100%[===================>]   1.33G  6.69MB/s    in 4m 15s  \n",
      "\n",
      "2026-04-17 15:28:40 (5.35 MB/s) - ‘hear2021-speech_commands-v0.0.2-5h-48000.tar.gz?download=1’ saved [1430299345/1430299345]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# We download the dataset from zenodo using wget\n",
    "\n",
    "!wget https://zenodo.org/records/5887964/files/hear2021-speech_commands-v0.0.2-5h-48000.tar.gz?download=1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1b5753fe",
   "metadata": {},
   "outputs": [],
   "source": [
    "# We extract the downloaded tar.gz file and move the contents to the /data directory (folder should exist)\n",
    "!tar -zxf ./hear2021-speech_commands-v0.0.2-5h-48000.tar.gz?download=1 -C /data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "40ddbb1a",
   "metadata": {},
   "source": [
    "Now the dataset is available at `/data/hear-2021.0.6/tasks/speech_commands-v0.0.2-5h`. The folder contains the following files:\n",
    "\n",
    "- labelvocabulary.csv: Containing the class mapping between class names and integer values.\n",
    "- task_metadata.json: Metadata of the dataset\n",
    "- train.json: The audio filenames corresponding to the training set.\n",
    "- valid.json: The audio filenames corresponding to the validation set.\n",
    "- test.json: The audio filenames corresponding to the test set.\n",
    "\n",
    "The folder `48000` contains three subfolders `train`, `test`, `valid`, each containing the respective audio files of the specified split in 48KHz sampling rate format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "4633ec79",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'task_name': 'speech_commands',\n",
       " 'version': 'v0.0.2',\n",
       " 'embedding_type': 'scene',\n",
       " 'prediction_type': 'multiclass',\n",
       " 'split_mode': 'trainvaltest',\n",
       " 'sample_duration': 1.0,\n",
       " 'evaluation': ['top1_acc'],\n",
       " 'download_urls': [{'split': 'train',\n",
       "   'url': 'http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz',\n",
       "   'md5': '6b74f3901214cb2c2934e98196829835'},\n",
       "  {'split': 'test',\n",
       "   'url': 'http://download.tensorflow.org/data/speech_commands_test_set_v0.02.tar.gz',\n",
       "   'md5': '854c580ee90bff80c516491c84544e32'}],\n",
       " 'default_mode': '5h',\n",
       " 'max_task_duration_by_split': {'train': 16000.0,\n",
       "  'valid': 2000.0,\n",
       "  'test': None},\n",
       " 'tmp_dir': '_workdir',\n",
       " 'mode': '5h',\n",
       " 'splits': ['train', 'valid', 'test']}"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# We inspect the contests of the medatata file\n",
    "import json\n",
    "from pathlib import Path\n",
    "\n",
    "DATA_PATH = Path(\"/data/hear-2021.0.6/tasks/speech_commands-v0.0.2-5h/\")\n",
    "TRAIN_PATH = DATA_PATH / \"48000\" / \"train\"\n",
    "TEST_PATH = DATA_PATH / \"48000\" / \"test\"\n",
    "VALID_PATH = DATA_PATH / \"48000\" / \"valid\"\n",
    "\n",
    "with open(DATA_PATH / \"task_metadata.json\", \"r\") as f:\n",
    "    metadata = json.load(f)\n",
    "\n",
    "metadata"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1eb95e7b",
   "metadata": {},
   "source": [
    "Through the metadata we see that each audio is 1-second long. Therefore, we will set `segment_duration=1.0` for creating the PyTorch dataset. Below we inspect the format of the json splitting files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "adc51b1c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "_silence__doing_the_dishes-1048000.wav ['_silence_']\n"
     ]
    }
   ],
   "source": [
    "with open(DATA_PATH / \"train.json\", \"r\") as f:\n",
    "    train_json = json.load(f)\n",
    "    \n",
    "# Inspect the first entry in the train.json file\n",
    "key, value = next(iter(train_json.items()))\n",
    "\n",
    "print(key, value)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5046c2df",
   "metadata": {},
   "source": [
    "We see that the json maps the filenames to the individual classes. We parse the json files for the validation / test splits in similar manner."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "b3f8bdd9",
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(DATA_PATH / \"test.json\", \"r\") as f:\n",
    "    test_json = json.load(f)\n",
    "    \n",
    "with open(DATA_PATH / \"valid.json\", \"r\") as f:\n",
    "    valid_json = json.load(f)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "118f96a2",
   "metadata": {},
   "source": [
    "## 2. Dataset Creation using Python Dictionaries\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e09d3f2",
   "metadata": {},
   "source": [
    "Now that we understand the structure of the dataset we can easily create the datasets. We first define the `class_mapping` through the `labelvocabulary.csv` file which is available."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "25f562c0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'_silence_': 0,\n",
       " '_unknown_': 1,\n",
       " 'down': 2,\n",
       " 'go': 3,\n",
       " 'left': 4,\n",
       " 'no': 5,\n",
       " 'off': 6,\n",
       " 'on': 7,\n",
       " 'right': 8,\n",
       " 'stop': 9,\n",
       " 'up': 10,\n",
       " 'yes': 11}"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import csv\n",
    "\n",
    "with open(DATA_PATH / \"labelvocabulary.csv\", \"r\") as f:\n",
    "    reader = csv.reader(f)\n",
    "    next(reader)  # Skip the header row\n",
    "    label_mapping = {rows[0]: rows[1] for rows in reader}\n",
    "    \n",
    "class_mapping = {v: int(k) for k, v in label_mapping.items()}\n",
    "\n",
    "class_mapping"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "26d7c3ce",
   "metadata": {},
   "source": [
    "To instantiate a PyTorch Dataset for audio classification we use the method `audio_classification_dataset_from_dictionary`. The method expects the same arguments as the `audio_classification_dataset_from_dir` with the exception that instead of a path we provide a Python dictionary of the form `{\"<abs_path_to_file>\": \"class_name\"}`. This is handled by the `file_to_class_mapping` argument. Luckily for us, this information is contained in the `train_json, valid_json`, and `test_json` variables defined previously."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "7ff0d938",
   "metadata": {},
   "outputs": [],
   "source": [
    "from deepaudiox import audio_classification_dataset_from_dictionary\n",
    "\n",
    "# We only need to prepend the absolute path and index the class label for the dataset\n",
    "train_json = {str(TRAIN_PATH / key): value[0] for key, value in train_json.items()}\n",
    "valid_json = {str(VALID_PATH / key): value[0] for key, value in valid_json.items()}\n",
    "test_json = {str(TEST_PATH / key): value[0] for key, value in test_json.items()}\n",
    "\n",
    "train_dset = audio_classification_dataset_from_dictionary(file_to_class_mapping=train_json,\n",
    "                                                          class_mapping=class_mapping,\n",
    "                                                          sample_rate=32000,\n",
    "                                                          segment_duration=1.0)\n",
    "\n",
    "valid_dset = audio_classification_dataset_from_dictionary(file_to_class_mapping=valid_json,\n",
    "                                                          class_mapping=class_mapping,\n",
    "                                                          sample_rate=32000,\n",
    "                                                          segment_duration=1.0)\n",
    "\n",
    "test_dset = audio_classification_dataset_from_dictionary(file_to_class_mapping=test_json,\n",
    "                                                          class_mapping=class_mapping,\n",
    "                                                          sample_rate=32000,\n",
    "                                                          segment_duration=1.0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "008d2d8a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'path': '/data/hear-2021.0.6/tasks/speech_commands-v0.0.2-5h/48000/train/_silence__doing_the_dishes-1048000.wav', 'y_true': 0, 'class_name': '_silence_', 'segment_idx': 0, 'feature': array([ 0.01144081,  0.00943983,  0.00135719, ..., -0.01853629,\n",
      "       -0.0183027 , -0.0120908 ], shape=(32000,), dtype=float32)}\n"
     ]
    }
   ],
   "source": [
    "# Check the first entry in the training dataset\n",
    "print(train_dset[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "9b04bba7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of training samples: 16000\n",
      "Number of validation samples: 2000\n",
      "Number of test samples: 4890\n"
     ]
    }
   ],
   "source": [
    "# Check the lengths of the datasets\n",
    "print(f\"Number of training samples: {len(train_dset)}\")\n",
    "print(f\"Number of validation samples: {len(valid_dset)}\")\n",
    "print(f\"Number of test samples: {len(test_dset)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f1ca84a4",
   "metadata": {},
   "source": [
    "## 3. Initializing the AudioClassifier\n",
    "\n",
    "Now the rest is easy. The steps are Classifier Initialization -> Trainer -> Evaluator. We instantiate a simple audio classifier using MobileNet as backbone feature extractor - a lightweight CNN-based architecture enabling fast training. Since the backbone is lightweight we train it from scratch."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "78263c3a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from deepaudiox import AudioClassifier\n",
    "\n",
    "model = AudioClassifier(backbone=\"mobilenet_10_as\",\n",
    "                        num_classes=len(class_mapping),\n",
    "                        freeze_backbone=False,\n",
    "                        pretrained=True,\n",
    "                        sample_rate=32_000)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56281d84",
   "metadata": {},
   "source": [
    "To see all the available backbones on the library use the `AVAILABLE_BACKBONES` variable lists all backbones."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "5faa9122",
   "metadata": {},
   "outputs": [],
   "source": [
    "from deepaudiox import AVAILABLE_BACKBONES"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "e7f874b8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "AudioClassifierConstructor(\n",
       "  (backbone_constructor): BackboneConstructor(\n",
       "    (backbone): MobileNet(\n",
       "      (features): Sequential(\n",
       "        (0): Conv2dNormActivation(\n",
       "          (0): Conv2d(1, 16, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)\n",
       "          (1): BatchNorm2d(16, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "          (2): Hardswish()\n",
       "        )\n",
       "        (1): InvertedResidual(\n",
       "          (block): Sequential(\n",
       "            (0): Conv2dNormActivation(\n",
       "              (0): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=16, bias=False)\n",
       "              (1): BatchNorm2d(16, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "              (2): ReLU(inplace=True)\n",
       "            )\n",
       "            (1): Conv2dNormActivation(\n",
       "              (0): Conv2d(16, 16, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
       "              (1): BatchNorm2d(16, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "            )\n",
       "          )\n",
       "        )\n",
       "        (2): InvertedResidual(\n",
       "          (block): Sequential(\n",
       "            (0): Conv2dNormActivation(\n",
       "              (0): Conv2d(16, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
       "              (1): BatchNorm2d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "              (2): ReLU(inplace=True)\n",
       "            )\n",
       "            (1): Conv2dNormActivation(\n",
       "              (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), groups=64, bias=False)\n",
       "              (1): BatchNorm2d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "              (2): ReLU(inplace=True)\n",
       "            )\n",
       "            (2): Conv2dNormActivation(\n",
       "              (0): Conv2d(64, 24, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
       "              (1): BatchNorm2d(24, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "            )\n",
       "          )\n",
       "        )\n",
       "        (3): InvertedResidual(\n",
       "          (block): Sequential(\n",
       "            (0): Conv2dNormActivation(\n",
       "              (0): Conv2d(24, 72, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
       "              (1): BatchNorm2d(72, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "              (2): ReLU(inplace=True)\n",
       "            )\n",
       "            (1): Conv2dNormActivation(\n",
       "              (0): Conv2d(72, 72, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=72, bias=False)\n",
       "              (1): BatchNorm2d(72, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "              (2): ReLU(inplace=True)\n",
       "            )\n",
       "            (2): Conv2dNormActivation(\n",
       "              (0): Conv2d(72, 24, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
       "              (1): BatchNorm2d(24, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "            )\n",
       "          )\n",
       "        )\n",
       "        (4): InvertedResidual(\n",
       "          (block): Sequential(\n",
       "            (0): Conv2dNormActivation(\n",
       "              (0): Conv2d(24, 72, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
       "              (1): BatchNorm2d(72, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "              (2): ReLU(inplace=True)\n",
       "            )\n",
       "            (1): Conv2dNormActivation(\n",
       "              (0): Conv2d(72, 72, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), groups=72, bias=False)\n",
       "              (1): BatchNorm2d(72, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "              (2): ReLU(inplace=True)\n",
       "            )\n",
       "            (2): ConcurrentSEBlock(\n",
       "              (conc_se_layers): ModuleList(\n",
       "                (0): SqueezeExcitation(\n",
       "                  (fc1): Linear(in_features=72, out_features=24, bias=True)\n",
       "                  (fc2): Linear(in_features=24, out_features=72, bias=True)\n",
       "                  (activation): ReLU()\n",
       "                  (scale_activation): Sigmoid()\n",
       "                )\n",
       "              )\n",
       "            )\n",
       "            (3): Conv2dNormActivation(\n",
       "              (0): Conv2d(72, 40, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
       "              (1): BatchNorm2d(40, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "            )\n",
       "          )\n",
       "        )\n",
       "        (5): InvertedResidual(\n",
       "          (block): Sequential(\n",
       "            (0): Conv2dNormActivation(\n",
       "              (0): Conv2d(40, 120, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
       "              (1): BatchNorm2d(120, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "              (2): ReLU(inplace=True)\n",
       "            )\n",
       "            (1): Conv2dNormActivation(\n",
       "              (0): Conv2d(120, 120, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=120, bias=False)\n",
       "              (1): BatchNorm2d(120, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "              (2): ReLU(inplace=True)\n",
       "            )\n",
       "            (2): ConcurrentSEBlock(\n",
       "              (conc_se_layers): ModuleList(\n",
       "                (0): SqueezeExcitation(\n",
       "                  (fc1): Linear(in_features=120, out_features=32, bias=True)\n",
       "                  (fc2): Linear(in_features=32, out_features=120, bias=True)\n",
       "                  (activation): ReLU()\n",
       "                  (scale_activation): Sigmoid()\n",
       "                )\n",
       "              )\n",
       "            )\n",
       "            (3): Conv2dNormActivation(\n",
       "              (0): Conv2d(120, 40, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
       "              (1): BatchNorm2d(40, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "            )\n",
       "          )\n",
       "        )\n",
       "        (6): InvertedResidual(\n",
       "          (block): Sequential(\n",
       "            (0): Conv2dNormActivation(\n",
       "              (0): Conv2d(40, 120, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
       "              (1): BatchNorm2d(120, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "              (2): ReLU(inplace=True)\n",
       "            )\n",
       "            (1): Conv2dNormActivation(\n",
       "              (0): Conv2d(120, 120, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=120, bias=False)\n",
       "              (1): BatchNorm2d(120, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "              (2): ReLU(inplace=True)\n",
       "            )\n",
       "            (2): ConcurrentSEBlock(\n",
       "              (conc_se_layers): ModuleList(\n",
       "                (0): SqueezeExcitation(\n",
       "                  (fc1): Linear(in_features=120, out_features=32, bias=True)\n",
       "                  (fc2): Linear(in_features=32, out_features=120, bias=True)\n",
       "                  (activation): ReLU()\n",
       "                  (scale_activation): Sigmoid()\n",
       "                )\n",
       "              )\n",
       "            )\n",
       "            (3): Conv2dNormActivation(\n",
       "              (0): Conv2d(120, 40, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
       "              (1): BatchNorm2d(40, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "            )\n",
       "          )\n",
       "        )\n",
       "        (7): InvertedResidual(\n",
       "          (block): Sequential(\n",
       "            (0): Conv2dNormActivation(\n",
       "              (0): Conv2d(40, 240, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
       "              (1): BatchNorm2d(240, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "              (2): Hardswish()\n",
       "            )\n",
       "            (1): Conv2dNormActivation(\n",
       "              (0): Conv2d(240, 240, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), groups=240, bias=False)\n",
       "              (1): BatchNorm2d(240, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "              (2): Hardswish()\n",
       "            )\n",
       "            (2): Conv2dNormActivation(\n",
       "              (0): Conv2d(240, 80, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
       "              (1): BatchNorm2d(80, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "            )\n",
       "          )\n",
       "        )\n",
       "        (8): InvertedResidual(\n",
       "          (block): Sequential(\n",
       "            (0): Conv2dNormActivation(\n",
       "              (0): Conv2d(80, 200, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
       "              (1): BatchNorm2d(200, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "              (2): Hardswish()\n",
       "            )\n",
       "            (1): Conv2dNormActivation(\n",
       "              (0): Conv2d(200, 200, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=200, bias=False)\n",
       "              (1): BatchNorm2d(200, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "              (2): Hardswish()\n",
       "            )\n",
       "            (2): Conv2dNormActivation(\n",
       "              (0): Conv2d(200, 80, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
       "              (1): BatchNorm2d(80, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "            )\n",
       "          )\n",
       "        )\n",
       "        (9): InvertedResidual(\n",
       "          (block): Sequential(\n",
       "            (0): Conv2dNormActivation(\n",
       "              (0): Conv2d(80, 184, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
       "              (1): BatchNorm2d(184, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "              (2): Hardswish()\n",
       "            )\n",
       "            (1): Conv2dNormActivation(\n",
       "              (0): Conv2d(184, 184, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=184, bias=False)\n",
       "              (1): BatchNorm2d(184, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "              (2): Hardswish()\n",
       "            )\n",
       "            (2): Conv2dNormActivation(\n",
       "              (0): Conv2d(184, 80, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
       "              (1): BatchNorm2d(80, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "            )\n",
       "          )\n",
       "        )\n",
       "        (10): InvertedResidual(\n",
       "          (block): Sequential(\n",
       "            (0): Conv2dNormActivation(\n",
       "              (0): Conv2d(80, 184, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
       "              (1): BatchNorm2d(184, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "              (2): Hardswish()\n",
       "            )\n",
       "            (1): Conv2dNormActivation(\n",
       "              (0): Conv2d(184, 184, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=184, bias=False)\n",
       "              (1): BatchNorm2d(184, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "              (2): Hardswish()\n",
       "            )\n",
       "            (2): Conv2dNormActivation(\n",
       "              (0): Conv2d(184, 80, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
       "              (1): BatchNorm2d(80, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "            )\n",
       "          )\n",
       "        )\n",
       "        (11): InvertedResidual(\n",
       "          (block): Sequential(\n",
       "            (0): Conv2dNormActivation(\n",
       "              (0): Conv2d(80, 480, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
       "              (1): BatchNorm2d(480, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "              (2): Hardswish()\n",
       "            )\n",
       "            (1): Conv2dNormActivation(\n",
       "              (0): Conv2d(480, 480, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=480, bias=False)\n",
       "              (1): BatchNorm2d(480, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "              (2): Hardswish()\n",
       "            )\n",
       "            (2): ConcurrentSEBlock(\n",
       "              (conc_se_layers): ModuleList(\n",
       "                (0): SqueezeExcitation(\n",
       "                  (fc1): Linear(in_features=480, out_features=120, bias=True)\n",
       "                  (fc2): Linear(in_features=120, out_features=480, bias=True)\n",
       "                  (activation): ReLU()\n",
       "                  (scale_activation): Sigmoid()\n",
       "                )\n",
       "              )\n",
       "            )\n",
       "            (3): Conv2dNormActivation(\n",
       "              (0): Conv2d(480, 112, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
       "              (1): BatchNorm2d(112, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "            )\n",
       "          )\n",
       "        )\n",
       "        (12): InvertedResidual(\n",
       "          (block): Sequential(\n",
       "            (0): Conv2dNormActivation(\n",
       "              (0): Conv2d(112, 672, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
       "              (1): BatchNorm2d(672, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "              (2): Hardswish()\n",
       "            )\n",
       "            (1): Conv2dNormActivation(\n",
       "              (0): Conv2d(672, 672, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=672, bias=False)\n",
       "              (1): BatchNorm2d(672, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "              (2): Hardswish()\n",
       "            )\n",
       "            (2): ConcurrentSEBlock(\n",
       "              (conc_se_layers): ModuleList(\n",
       "                (0): SqueezeExcitation(\n",
       "                  (fc1): Linear(in_features=672, out_features=168, bias=True)\n",
       "                  (fc2): Linear(in_features=168, out_features=672, bias=True)\n",
       "                  (activation): ReLU()\n",
       "                  (scale_activation): Sigmoid()\n",
       "                )\n",
       "              )\n",
       "            )\n",
       "            (3): Conv2dNormActivation(\n",
       "              (0): Conv2d(672, 112, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
       "              (1): BatchNorm2d(112, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "            )\n",
       "          )\n",
       "        )\n",
       "        (13): InvertedResidual(\n",
       "          (block): Sequential(\n",
       "            (0): Conv2dNormActivation(\n",
       "              (0): Conv2d(112, 672, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
       "              (1): BatchNorm2d(672, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "              (2): Hardswish()\n",
       "            )\n",
       "            (1): Conv2dNormActivation(\n",
       "              (0): Conv2d(672, 672, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), groups=672, bias=False)\n",
       "              (1): BatchNorm2d(672, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "              (2): Hardswish()\n",
       "            )\n",
       "            (2): ConcurrentSEBlock(\n",
       "              (conc_se_layers): ModuleList(\n",
       "                (0): SqueezeExcitation(\n",
       "                  (fc1): Linear(in_features=672, out_features=168, bias=True)\n",
       "                  (fc2): Linear(in_features=168, out_features=672, bias=True)\n",
       "                  (activation): ReLU()\n",
       "                  (scale_activation): Sigmoid()\n",
       "                )\n",
       "              )\n",
       "            )\n",
       "            (3): Conv2dNormActivation(\n",
       "              (0): Conv2d(672, 160, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
       "              (1): BatchNorm2d(160, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "            )\n",
       "          )\n",
       "        )\n",
       "        (14): InvertedResidual(\n",
       "          (block): Sequential(\n",
       "            (0): Conv2dNormActivation(\n",
       "              (0): Conv2d(160, 960, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
       "              (1): BatchNorm2d(960, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "              (2): Hardswish()\n",
       "            )\n",
       "            (1): Conv2dNormActivation(\n",
       "              (0): Conv2d(960, 960, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=960, bias=False)\n",
       "              (1): BatchNorm2d(960, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "              (2): Hardswish()\n",
       "            )\n",
       "            (2): ConcurrentSEBlock(\n",
       "              (conc_se_layers): ModuleList(\n",
       "                (0): SqueezeExcitation(\n",
       "                  (fc1): Linear(in_features=960, out_features=240, bias=True)\n",
       "                  (fc2): Linear(in_features=240, out_features=960, bias=True)\n",
       "                  (activation): ReLU()\n",
       "                  (scale_activation): Sigmoid()\n",
       "                )\n",
       "              )\n",
       "            )\n",
       "            (3): Conv2dNormActivation(\n",
       "              (0): Conv2d(960, 160, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
       "              (1): BatchNorm2d(160, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "            )\n",
       "          )\n",
       "        )\n",
       "        (15): InvertedResidual(\n",
       "          (block): Sequential(\n",
       "            (0): Conv2dNormActivation(\n",
       "              (0): Conv2d(160, 960, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
       "              (1): BatchNorm2d(960, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "              (2): Hardswish()\n",
       "            )\n",
       "            (1): Conv2dNormActivation(\n",
       "              (0): Conv2d(960, 960, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=960, bias=False)\n",
       "              (1): BatchNorm2d(960, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "              (2): Hardswish()\n",
       "            )\n",
       "            (2): ConcurrentSEBlock(\n",
       "              (conc_se_layers): ModuleList(\n",
       "                (0): SqueezeExcitation(\n",
       "                  (fc1): Linear(in_features=960, out_features=240, bias=True)\n",
       "                  (fc2): Linear(in_features=240, out_features=960, bias=True)\n",
       "                  (activation): ReLU()\n",
       "                  (scale_activation): Sigmoid()\n",
       "                )\n",
       "              )\n",
       "            )\n",
       "            (3): Conv2dNormActivation(\n",
       "              (0): Conv2d(960, 160, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
       "              (1): BatchNorm2d(160, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "            )\n",
       "          )\n",
       "        )\n",
       "        (16): Conv2dNormActivation(\n",
       "          (0): Conv2d(160, 960, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
       "          (1): BatchNorm2d(960, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n",
       "          (2): Hardswish()\n",
       "        )\n",
       "      )\n",
       "      (feature_extractor): AugmentMelSTFT(\n",
       "        (freqm): FrequencyMasking()\n",
       "        (timem): TimeMasking()\n",
       "      )\n",
       "    )\n",
       "    (pooling): GAP()\n",
       "  )\n",
       "  (classifier): MLPHead(\n",
       "    (model): Sequential(\n",
       "      (0): Linear(in_features=960, out_features=12, bias=True)\n",
       "    )\n",
       "  )\n",
       ")"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Model Inspection\n",
    "model"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "05032396",
   "metadata": {},
   "source": [
    "## 4. Training\n",
    "\n",
    "Now we are ready to train our model for speech command classification. Note that in this case, the dataset comes with a predetermined validation dataset where we can utilize during training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "569f9461",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Epoch 1/50]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Using GPU: NVIDIA GeForce RTX 4090\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Epoch 1 | Train Loss: 1.5644 | Val. Loss: 1.5606 | Time: 3.32s      \n",
      "[CHECKPOINTER] Validation loss decreased: (inf --> 1.560594), \u001b[92m(-nan%)\u001b[0m.\n",
      "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n",
      "[Epoch 2/50]\n",
      "Epoch 2 | Train Loss: 1.4823 | Val. Loss: 0.9147 | Time: 2.62s      \n",
      "[CHECKPOINTER] Validation loss decreased: (1.560594 --> 0.914667), \u001b[92m(-41.39%)\u001b[0m.\n",
      "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n",
      "[Epoch 3/50]\n",
      "Epoch 3 | Train Loss: 1.3784 | Val. Loss: 1.4767 | Time: 2.66s      \n",
      "[Epoch 4/50]\n",
      "Epoch 4 | Train Loss: 1.3140 | Val. Loss: 0.4436 | Time: 2.60s      \n",
      "[CHECKPOINTER] Validation loss decreased: (0.914667 --> 0.443601), \u001b[92m(-51.50%)\u001b[0m.\n",
      "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n",
      "[Epoch 5/50]\n",
      "Epoch 5 | Train Loss: 1.3024 | Val. Loss: 0.3455 | Time: 2.65s      \n",
      "[CHECKPOINTER] Validation loss decreased: (0.443601 --> 0.345517), \u001b[92m(-22.11%)\u001b[0m.\n",
      "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n",
      "[Epoch 6/50]\n",
      "Epoch 6 | Train Loss: 1.2830 | Val. Loss: 0.4031 | Time: 2.65s      \n",
      "[Epoch 7/50]\n",
      "Epoch 7 | Train Loss: 1.2611 | Val. Loss: 0.2845 | Time: 2.62s      \n",
      "[CHECKPOINTER] Validation loss decreased: (0.345517 --> 0.284467), \u001b[92m(-17.67%)\u001b[0m.\n",
      "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n",
      "[Epoch 8/50]\n",
      "Epoch 8 | Train Loss: 1.2377 | Val. Loss: 0.2525 | Time: 2.67s      \n",
      "[CHECKPOINTER] Validation loss decreased: (0.284467 --> 0.252507), \u001b[92m(-11.23%)\u001b[0m.\n",
      "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n",
      "[Epoch 9/50]\n",
      "Epoch 9 | Train Loss: 1.2382 | Val. Loss: 0.2880 | Time: 2.68s      \n",
      "[Epoch 10/50]\n",
      "Epoch 10 | Train Loss: 1.2242 | Val. Loss: 0.2928 | Time: 2.65s     \n",
      "[Epoch 11/50]\n",
      "Epoch 11 | Train Loss: 1.2149 | Val. Loss: 0.2199 | Time: 2.61s     \n",
      "[CHECKPOINTER] Validation loss decreased: (0.252507 --> 0.219867), \u001b[92m(-12.93%)\u001b[0m.\n",
      "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n",
      "[Epoch 12/50]\n",
      "Epoch 12 | Train Loss: 1.2129 | Val. Loss: 0.2184 | Time: 2.68s     \n",
      "[CHECKPOINTER] Validation loss decreased: (0.219867 --> 0.218392), \u001b[92m(-0.67%)\u001b[0m.\n",
      "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n",
      "[Epoch 13/50]\n",
      "Epoch 13 | Train Loss: 1.2014 | Val. Loss: 0.1860 | Time: 2.70s     \n",
      "[CHECKPOINTER] Validation loss decreased: (0.218392 --> 0.185957), \u001b[92m(-14.85%)\u001b[0m.\n",
      "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n",
      "[Epoch 14/50]\n",
      "Epoch 14 | Train Loss: 1.2113 | Val. Loss: 0.1765 | Time: 2.67s     \n",
      "[CHECKPOINTER] Validation loss decreased: (0.185957 --> 0.176475), \u001b[92m(-5.10%)\u001b[0m.\n",
      "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n",
      "[Epoch 15/50]\n",
      "Epoch 15 | Train Loss: 1.1968 | Val. Loss: 0.1685 | Time: 2.74s     \n",
      "[CHECKPOINTER] Validation loss decreased: (0.176475 --> 0.168494), \u001b[92m(-4.52%)\u001b[0m.\n",
      "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n",
      "[Epoch 16/50]\n",
      "Epoch 16 | Train Loss: 1.2047 | Val. Loss: 0.2046 | Time: 2.72s     \n",
      "[Epoch 17/50]\n",
      "Epoch 17 | Train Loss: 1.1981 | Val. Loss: 0.1594 | Time: 2.68s     \n",
      "[CHECKPOINTER] Validation loss decreased: (0.168494 --> 0.159420), \u001b[92m(-5.39%)\u001b[0m.\n",
      "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n",
      "[Epoch 18/50]\n",
      "Epoch 18 | Train Loss: 1.1948 | Val. Loss: 0.1701 | Time: 2.69s     \n",
      "[Epoch 19/50]\n",
      "Epoch 19 | Train Loss: 1.1924 | Val. Loss: 0.1541 | Time: 2.63s     \n",
      "[CHECKPOINTER] Validation loss decreased: (0.159420 --> 0.154148), \u001b[92m(-3.31%)\u001b[0m.\n",
      "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n",
      "[Epoch 20/50]\n",
      "Epoch 20 | Train Loss: 1.1851 | Val. Loss: 0.1478 | Time: 2.68s     \n",
      "[CHECKPOINTER] Validation loss decreased: (0.154148 --> 0.147844), \u001b[92m(-4.09%)\u001b[0m.\n",
      "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n",
      "[Epoch 21/50]\n",
      "Epoch 21 | Train Loss: 1.1814 | Val. Loss: 0.1905 | Time: 2.67s     \n",
      "[Epoch 22/50]\n",
      "Epoch 22 | Train Loss: 1.1673 | Val. Loss: 0.1482 | Time: 2.60s     \n",
      "[Epoch 23/50]\n",
      "Epoch 23 | Train Loss: 1.1719 | Val. Loss: 0.1611 | Time: 2.66s     \n",
      "[Epoch 24/50]\n",
      "Epoch 24 | Train Loss: 1.1771 | Val. Loss: 0.1800 | Time: 2.65s     \n",
      "[Epoch 25/50]\n",
      "Epoch 25 | Train Loss: 1.1650 | Val. Loss: 0.1583 | Time: 2.67s     \n",
      "[Epoch 26/50]\n",
      "Epoch 26 | Train Loss: 1.1611 | Val. Loss: 0.1416 | Time: 2.63s     \n",
      "[CHECKPOINTER] Validation loss decreased: (0.147844 --> 0.141557), \u001b[92m(-4.25%)\u001b[0m.\n",
      "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n",
      "[Epoch 27/50]\n",
      "Epoch 27 | Train Loss: 1.1593 | Val. Loss: 0.1402 | Time: 2.68s     \n",
      "[CHECKPOINTER] Validation loss decreased: (0.141557 --> 0.140151), \u001b[92m(-0.99%)\u001b[0m.\n",
      "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n",
      "[Epoch 28/50]\n",
      "Epoch 28 | Train Loss: 1.1600 | Val. Loss: 0.1407 | Time: 2.70s     \n",
      "[Epoch 29/50]\n",
      "Epoch 29 | Train Loss: 1.1700 | Val. Loss: 0.1624 | Time: 2.67s     \n",
      "[Epoch 30/50]\n",
      "Epoch 30 | Train Loss: 1.1484 | Val. Loss: 0.1412 | Time: 2.63s     \n",
      "[Epoch 31/50]\n",
      "Epoch 31 | Train Loss: 1.1673 | Val. Loss: 0.1415 | Time: 2.63s     \n",
      "[Epoch 32/50]\n",
      "Epoch 32 | Train Loss: 1.1712 | Val. Loss: 0.1409 | Time: 2.63s     \n",
      "[Epoch 33/50]\n",
      "Epoch 33 | Train Loss: 1.1398 | Val. Loss: 0.1438 | Time: 2.68s     \n",
      "[Epoch 34/50]\n",
      "Epoch 34 | Train Loss: 1.1542 | Val. Loss: 0.1260 | Time: 2.64s     \n",
      "[CHECKPOINTER] Validation loss decreased: (0.140151 --> 0.126029), \u001b[92m(-10.08%)\u001b[0m.\n",
      "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n",
      "[Epoch 35/50]\n",
      "Epoch 35 | Train Loss: 1.1653 | Val. Loss: 0.1367 | Time: 2.70s     \n",
      "[Epoch 36/50]\n",
      "Epoch 36 | Train Loss: 1.1419 | Val. Loss: 0.1463 | Time: 2.66s     \n",
      "[Epoch 37/50]\n",
      "Epoch 37 | Train Loss: 1.1602 | Val. Loss: 0.1179 | Time: 2.64s     \n",
      "[CHECKPOINTER] Validation loss decreased: (0.126029 --> 0.117883), \u001b[92m(-6.46%)\u001b[0m.\n",
      "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n",
      "[Epoch 38/50]\n",
      "Epoch 38 | Train Loss: 1.1494 | Val. Loss: 0.1095 | Time: 2.69s     \n",
      "[CHECKPOINTER] Validation loss decreased: (0.117883 --> 0.109453), \u001b[92m(-7.15%)\u001b[0m.\n",
      "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n",
      "[Epoch 39/50]\n",
      "Epoch 39 | Train Loss: 1.1425 | Val. Loss: 0.1517 | Time: 2.70s     \n",
      "[Epoch 40/50]\n",
      "Epoch 40 | Train Loss: 1.1475 | Val. Loss: 0.1156 | Time: 2.64s     \n",
      "[Epoch 41/50]\n",
      "Epoch 41 | Train Loss: 1.1372 | Val. Loss: 0.2005 | Time: 2.67s     \n",
      "[Epoch 42/50]\n",
      "Epoch 42 | Train Loss: 1.1491 | Val. Loss: 0.1219 | Time: 2.64s     \n",
      "[Epoch 43/50]\n",
      "Epoch 43 | Train Loss: 1.1490 | Val. Loss: 0.1205 | Time: 2.67s     \n",
      "[Epoch 44/50]\n",
      "Epoch 44 | Train Loss: 1.1481 | Val. Loss: 0.1335 | Time: 2.68s     \n",
      "[Epoch 45/50]\n",
      "Epoch 45 | Train Loss: 1.1458 | Val. Loss: 0.1184 | Time: 2.66s     \n",
      "[Epoch 46/50]\n",
      "Epoch 46 | Train Loss: 1.1441 | Val. Loss: 0.1129 | Time: 2.68s     \n",
      "[EARLY STOPPING] Elapsed epochs: 8 out of 10\n",
      "[Epoch 47/50]\n",
      "Epoch 47 | Train Loss: 1.1367 | Val. Loss: 0.1086 | Time: 2.66s     \n",
      "[CHECKPOINTER] Validation loss decreased: (0.109453 --> 0.108631), \u001b[92m(-0.75%)\u001b[0m.\n",
      "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n",
      "[Epoch 48/50]\n",
      "Epoch 48 | Train Loss: 1.1329 | Val. Loss: 0.1046 | Time: 2.69s     \n",
      "[CHECKPOINTER] Validation loss decreased: (0.108631 --> 0.104579), \u001b[92m(-3.73%)\u001b[0m.\n",
      "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n",
      "[Epoch 49/50]\n",
      "Epoch 49 | Train Loss: 1.1461 | Val. Loss: 0.1272 | Time: 2.71s     \n",
      "[Epoch 50/50]\n",
      "Epoch 50 | Train Loss: 1.1353 | Val. Loss: 0.1148 | Time: 2.65s     \n",
      "Training has finished.\n"
     ]
    }
   ],
   "source": [
    "from deepaudiox import Trainer\n",
    "\n",
    "trainer = Trainer(model=model,\n",
    "                  train_dset=train_dset,\n",
    "                  validation_dset=valid_dset,\n",
    "                  epochs=50,\n",
    "                  batch_size=128,\n",
    "                  patience=10)\n",
    "\n",
    "trainer.train()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "106717f5",
   "metadata": {},
   "source": [
    "## 5. Evaluation\n",
    "\n",
    "In similar manner as in the first tutorial, we use the `Evaluator` to check the performance on the held-out test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "41e53110",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Using GPU: NVIDIA GeForce RTX 4090\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Testing has finished.                                                  \n",
      "[REPORTER] Class mapping: {'_silence_': 0, '_unknown_': 1, 'down': 2, 'go': 3, 'left': 4, 'no': 5, 'off': 6, 'on': 7, 'right': 8, 'stop': 9, 'up': 10, 'yes': 11} \n",
      "\n",
      "[REPORTER] Classification Report: \n",
      "\n",
      "              precision    recall  f1-score   support\n",
      "\n",
      "   _silence_       1.00      0.94      0.97       408\n",
      "   _unknown_       0.60      0.99      0.75       408\n",
      "        down       0.96      0.85      0.90       406\n",
      "          go       0.98      0.80      0.88       402\n",
      "        left       0.98      0.92      0.95       412\n",
      "          no       0.91      0.93      0.92       405\n",
      "         off       0.97      0.92      0.95       402\n",
      "          on       0.99      0.87      0.93       396\n",
      "       right       1.00      0.93      0.96       396\n",
      "        stop       1.00      0.99      0.99       411\n",
      "          up       0.97      0.96      0.97       425\n",
      "         yes       0.99      0.98      0.99       419\n",
      "\n",
      "    accuracy                           0.92      4890\n",
      "   macro avg       0.95      0.92      0.93      4890\n",
      "weighted avg       0.95      0.92      0.93      4890\n",
      "\n",
      "[REPORTER] Confusion Matrix: \n",
      "\n",
      "[[385  23   0   0   0   0   0   0   0   0   0   0]\n",
      " [  0 405   0   1   0   2   0   0   0   0   0   0]\n",
      " [  0  42 347   1   1  15   0   0   0   0   0   0]\n",
      " [  0  44  13 320   2  21   0   0   0   0   2   0]\n",
      " [  0  29   0   0 379   0   0   0   0   0   1   3]\n",
      " [  0  21   3   3   0 378   0   0   0   0   0   0]\n",
      " [  0  20   0   0   0   0 371   3   0   1   7   0]\n",
      " [  0  42   0   0   0   0   8 345   0   0   1   0]\n",
      " [  0  27   0   0   2   0   0   0 367   0   0   0]\n",
      " [  0   4   0   0   0   0   0   0   0 407   0   0]\n",
      " [  0  15   0   0   0   0   3   0   0   0 407   0]\n",
      " [  0   5   0   0   1   0   1   0   0   0   0 412]]\n",
      "[REPORTER] Average Posteriors: \n",
      "\n",
      "_silence_           : 0.987\n",
      "_unknown_           : 0.989\n",
      "down                : 0.967\n",
      "go                  : 0.926\n",
      "left                : 0.977\n",
      "no                  : 0.951\n",
      "off                 : 0.978\n",
      "on                  : 0.961\n",
      "right               : 0.975\n",
      "stop                : 0.997\n",
      "up                  : 0.982\n",
      "yes                 : 0.994\n"
     ]
    }
   ],
   "source": [
    "from deepaudiox import Evaluator\n",
    "\n",
    "# First load the best model checkpoint\n",
    "model = AudioClassifier.from_checkpoint(\"checkpoint.pt\")\n",
    "\n",
    "evaluator = Evaluator(model=model, test_dset=test_dset, class_mapping=class_mapping)\n",
    "\n",
    "evaluator.evaluate() "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f29130ab",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "deepaudio-x (3.13.9)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}